Some superpopulation models for estimating the number of population uniques
نویسنده
چکیده
The number of the unique individuals in the population is of great importance in evaluating the disclosure risk of a microdata set. We approach this problem by considering some basic superpopulation models including the gamma-Poisson model of Bethlehem et al. (1990). We introduce Dirichlet-multinomial model which is closely related but more basic than the gamma-Poisson model, in the sense that binomial distribution is more basic than Poisson distribution. We also discuss the Ewens model and show that it can be obtained from the Dirichlet-multinomial model by a limiting argument similar to the law of small numbers. The multivariate Ewens distribution is a basic mathematical model used in genetics. Estimation of the number of the population uniques is particularly simple under the Ewens model. Although these models might not necessarily well t actual populations, they can be considered as basic mathematical models for our problem, as binomial and Poisson distributions are considered as basic models for count data.
منابع مشابه
Estimating Identification Disclosure Risk Using Mixed Membership Models.
Statistical agencies and other organizations that disseminate data are obligated to protect data subjects' confidentiality. For example, ill-intentioned individuals might link data subjects to records in other databases by matching on common characteristics (keys). Successful links are particularly problematic for data subjects with combinations of keys that are unique in the population. Hence,...
متن کاملAssessing Microdata Disclosure Risk Using the Poisson-inverse Gaussian Distribution
An important measure of identification risk associated with the release of microdata or large complex tables is the number or proportion of population units that can be uniquely identified by some set of characterizing attributes which partition the population into subpopulations or cells. Various methods for estimating this quantity based on sample data have been proposed in the literature by ...
متن کاملMethods for Stratified Cluster Sampling with Informative Stratification
We look at fitting regression models using data from stratified cluster samples when the strata may depend in some way on the observed responses within clusters. One important subclass of examples is that of family studies in genetic epidemiology, where the probability of selecting a family into the study depends on the incidence of disease within the family. We develop the survey-weighted esti...
متن کاملResearch Article Methods for Stratified Cluster Sampling with Informative Stratification
We look at fitting regression models using data from stratified cluster samples when the strata may depend in some way on the observed responses within clusters. One important subclass of examples is that of family studies in genetic epidemiology, where the probability of selecting a family into the study depends on the incidence of disease within the family. We develop the survey-weighted esti...
متن کاملUsing the Superpopulation Model for Imputations and Variance Computation in Survey Sampling
INTRODUCTION For estimation of population characteristics (mainly totals, means, counts) in business statistics surveys, the Czech Statistical Office (CZSO) has been recently exploring a new approach, in which all data for units that are out of the sample are imputed based on predictions by regression, instead of estimating the population characteristics through weighting. The all-data imputati...
متن کامل